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Positive selection distorts the structure of genealogies and hence alters patterns of genetic variation 
within a population. Most analyses of these distortions focus on the signatures of hitchhiking due to 
hard or soft selective sweeps at a single genetic locus. However, in linked regions of rapidly adapt- 
ing genomes, multiple beneficial mutations at different loci can segregate simultaneously within the 
population, an effect known as clonal interference. This leads to a subtle interplay between hitch- 
hiking and interference effects, which leads to a unique signature of rapid adaptation on genetic 
variation both at the selected sites and at linked neutral loci. Here, we introduce an effective coa- 
lescent theory (a "fitness- class coalescent") that describes how positive selection at many perfectly 
linked sites alters the structure of genealogies. We use this theory to calculate several simple statis- 
tics describing genetic variation within a rapidly adapting population, and to implement efficient 
backwards-time coalescent simulations which can be used to predict how clonal interference alters 
the expected patterns of molecular evolution. 



I. INTRODUCTION 

Beneficial mutations drive long-term evolutionary 
adaptation, and despite their rarity they can dramati- 
cally alter the patterns of genetic diversity at linked sites. 
Extensive work has been devoted to characterizing these 
signatures in patterns of molecular evolution, and using 
them to infer which mutations have driven past adapta- 
tion. 

When beneficial mutations are rare and selection is 
strong, adaptation progresses via a series of selective 
sweeps. A single new beneficial mutation occurs in a 
single genetic background, and increases rapidly in fre- 
quency towards fixation. This is known as a "hard" se- 
lective sweep, and it purges genetic variation at linked 
sites and shortens coalescence times near the selected lo- 
cus [1 . Most statistical methods used to detect signals 
of adaptation in genomic scans are based on looking for 
signatures of these hard sweeps |2j-t6j. 

Hard selective sweeps are the primary mode of adap- 
tation in small to moderate sized populations in which 
beneficial mutations are sufficiently rare. However, in 
larger populations where beneficial mutations occur more 
frequently, many different mutant lineages can segregate 
simultaneously in the population. If the loci involved are 
sufficiently distant that recombination occurs frequently 
enough between them, their fates are independent and 
adaptation will proceed via independent hard sweeps at 
each locus. However, in largely asexual organisms such 
as microbes and viruses, and on shorter distance scales 
within sexual genomes, selective sweeps at linked loci can 
overlap and interfere with one another. This is referred 
to as clonal interference, or Hill-Robertson interference 
in sexual organisms ^ |8] . These interference effects can 



dramatically change both the evolutionary dynamics of 
adaptation and the signatures of positive selection in pat- 
terns of molecular evolution. We illustrate them schemat- 
ically in Fig. [l] 

We and others have characterized the evolutionary dy- 
namics by which a population accumulates beneficial mu- 
tations in the presence of clonal interference [71 l9Hl3]. 
Many recent experiments in a variety of different sys- 
tems have confirmed that these interference effects are 
important in a wide range of laboratory populations of 
microbes and viruses p^lHIS] , These theoretical and ex- 
perimental developments have recently been reviewed by 
Park et al. [19^ and Sniegowski and Gerrish [20 . 

Although this earlier theoretical work has provided a 
detailed characterization of evolutionary dynamics in the 
presence of clonal interference, it does not make any pre- 
dictions about the patterns of genetic variation within 
an adapting population. In this paper, we address this 
question of how clonal interference alters the structure 
of genealogies, and how this affects patterns of molecular 
evolution both at the sites underlying adaptation and at 
linked neutral sites. This has become particularly rele- 
vant in light of recent advances that now make it possible 
to sequence individuals and pooled population samples 
from microbial adaptation experiments jTS} f2TH23] . 

We note that much recent work in molecular evolution 
and statistical genetics has analyzed related scenarios 
where adaptation involves multiple mutations, motivated 
by recent theoretical work ^24f[26] and empirical data 
from Drosophila [27] and humans [28l [29] that suggests 
that simple hard sweeps may be rare. This includes most 
notably analysis of the effects of "soft sweeps," where re- 
current beneficial mutations occur at a single locus, or 
selection acts on standing variation at this locus [3QH32] . 
Soft sweeps drive multiple genetic backgrounds to moder- 
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FIG. 1. Schematic of the evolutionary dynamics of adaptation, (a) A small population adapts via a sequence of selective 
sweeps, (b) In a large rapidly adapting population, multiple beneficial mutations segregate concurrently. Some of these mutant 
lineage interfere with each others' fixation, while others hitchhike together. 



ate frequencies, leaving several deeper coalescence events 
and hence a weaker signature of reduced variation in the 
neighborhood of the selected locus than a hard sweep 
122,. 

In contrast to the situation we analyze here, both hard 
and soft sweeps refer to the action of selection at a sin- 
gle locus. We consider instead a case more analogous to 
models in quantitative genetics, where selection acts on a 
large number of loci that all affect fitness. In other words, 
our analysis of clonal interference can be thought of as a 
description of polygenic adaptation, where selection fa- 
vors the individuals who have beneficial alleles at mul- 
tiple loci. Recent work has argued for the potential im- 
portance of polygenic adaptation from standing genetic 
variation [6l [34] , loosely analogous to the case where soft 
sweeps act at many loci simultaneously [35, 36 . Our 
analysis in this paper, by contrast, describes polygenic 
adaptation via multiple new mutations of similar effect 
at many loci, where each locus has a low enough mu- 
tation rate that it would undergo a hard sweep in the 
absence of the other loci. 

As with hard and soft sweeps, the signatures of this 



form of adaptation on nearby genomic regions are de- 
termined by how it alters the structure and timing of 
coalescence events. In this paper, we therefore focus on 
computing how clonal interference alters the structure of 
genealogies. This involves two basic effects. On the one 
hand, mutations at the many loci occur and segregate si- 
multaneously, interfering with each others' fixation. This 
preserves some deeper coalescence events, as in a soft 
sweep. On the other hand, since the mutations occur 
at different sites, multiple beneficial mutations can also 
occur in the same genetic background and hitchhike to- 
gether. This tends to shorten coalescence times, making 
the signature of adaptation somewhat more like a "hard 
sweep." Together, these effects lead to unique patterns 
of genetic diversity characteristic of clonal interference. 

Our analysis of these effects is based on the fitness- 
class coalescent we previously used to describe the eflFects 
of purifying selection on the structure of genealogies [37] . 
This in turn is closely related to the structured coales- 
cent model of Hudson and Kaplan [38] . We begin in the 
next section by describing our model, and summarize our 
earlier analysis of the rate and dynamics of adaptation 



3 



in the presence of clonal interference, which describes the 
distribution of fitnesses within the population [11^. We 
then show how one can trace the ancestry of individuals 
as they "move" between different fitness classes via mu- 
tations (our fitness-class coalescent approach). We com- 
pute the probability that any set of individuals coalesce 
when they are within the same fitness class. This leads to 
a description of the probability of any possible genealog- 
ical relationship between a sample of individuals from 
the population. Finally, we show how the distortions in 
genealogical structure caused by clonal interference alter 
the distributions of simple statistics describing genetic 
variation at the selected loci as well as linked neutral loci. 
We also use our approach to implement coalescent sim- 
ulations analogous to those previously used to describe 
the action of purifying selection [39, 40 , based on the 
structured coalescent method of Hudson and Kaplan [38] . 
These coalescent simulations can be used to analyze in 
detail how this form of selection alters the structure of 
genealogies. 

Our results provide a theoretical framework for under- 
standing the patterns of genetic diversity within rapidly 
evolving experimental microbial populations. Our anal- 
ysis may also have relevance for understanding how per- 
vasive positive selection alters patterns of molecular evo- 
lution more generally, but we emphasize that our work 
here focuses entirely on asexual populations or on diver- 
sity within a short genomic region that remains perfectly 
linked over the relevant timescales. In the opposite case 
of strong recombination, adaptation will progress via in- 
dependent hard selective sweeps at each selected locus. 
Further work is required to understand the effects of in- 
termediate levels of recombination, where the approach 
recently introduced by Neher et al. [41 may provide a 
useful starting point. 



II. MODEL AND EVOLUTIONARY DYNAMICS 
A. Model 

We consider a finite haploid asexual population of con- 
stant size in which a large number of beneficial mu- 
tations are available, each of which increases fitness by 
the same amount s. We define as the total mutation 
rate to these mutations. We neglect deleterious muta- 
tions and beneficial mutations with other selective ad- 
vantages. We have previously shown that the dynamics 
in rapidly adapting populations are dominated by ben- 
eficial mutations of a specific fitness effect [TTl |T3l [42] . 
so this model is a useful starting point, but we return 
to discuss these assumptions further in the Discussion. 
We also assume that there is no epistasis for fitness, so 
the fitness of an individual with k beneficial mutations 
is Wk = (1 + 5)^ ^ 1 + sk. This is the same model 
of adaptation we have previously considered pTl and is 



largely equivalent to models used in most related theo- 
retical work on clonal interference [101 O US] • We will 
later also consider linked neutral sites with total muta- 
tion rate /7n, but for now we focus on the structure of 
genealogies and neglect neutral mutations. 

To analyze expected patterns of genetic variation, we 
must also make specific assumptions about how muta- 
tions occur at particular sites. We will consider a per- 
fectly linked genomic region which has a total of B loci at 
which beneficial mutations can occur. We assume these 
mutations occur at rate ja per locus, for a total beneficial 
mutation rate Ub = fiB. We will later take the infinite- 
sites limit, 5 ^ 00, while keeping the overall beneficial 
mutation rate Ui) constant. Each mutation is assumed 
to confer the same fitness advantage 5, where 5 <C 1. 
We will also assume throughout that selection is strong 
compared to mutations, s ^ Uh^ which allows us to use 
our earlier results in Desai and Fisher [TT] as a basis for 
our analysis. Analysis of the opposite case where s < Ub 
remains an important topic for future work, which could 
be based on alternative models of the dynamics such as 
the approach of Hallatschek [12 . Although our model is 
defined for haploids, our analysis also applies to diploid 
populations provided that there is no dominance (i.e., be- 
ing homozygous for the beneficial mutation carries twice 
the fitness benefit as being heterozygous). 

This model is the simplest framework that captures 
the effects of positive selection on a large number of inde- 
pendent loci of similar effect. However, the dynamics of 
adaptation in this model can be complex. Beginning from 
a population with no mutations at the selected loci, there 
is first a transient phase while variation at these loci ini- 
tially increases. There is then a steady state phase during 
which the population continuously adapts towards higher 
fitness. Finally, adaptation will eventually slow down as 
the population approaches a well-adapted state. In this 
paper, we focus on the second phase of rapid and con- 
tinuous adaptation, which has been the primary focus of 
previous work by us and others [TTJ [T2j [191 US] • Our goal 
is to understand how this continuous rapid adaptation 
alters the structure of genealogies and hence patterns of 
genetic variation. We begin in the next subsection by 
summarizing the relevant aspects of our earlier results 
for the distribution of fitness within the population. 



B. The distribution of fitness within the population 

In our model in which all beneficial mutations confer 
the same advantage, 5, the distribution of fitnesses within 
the population can be characterized by the fraction of the 
population, ^fc, that has k beneficial mutations more or 
less than the population average. We refer to this as 
"fitness class /c." 

When N and Ub are small, it is unlikely that a sec- 
ond beneficial mutation will occur while another is seg- 
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legating. Hence adaptation proceeds by a succession of 
selective sweeps. In this regime, beneficial mutations des- 
tined to survive drift arise at rate NU^s and then fix in 
J ln[A/'s] generations. Thus adaptation will occur by suc- 
cessive sweeps provided that 



to establish a new class at the nose of the distribution, 



\n[Ns 



(1) 



When this condition is met, the population is almost al- 
ways clonal or nearly clonal except during brief periods 
while a selective sweep is occurring. Thus we will have 
00 = 1 and (/)/c = for /c 7^ 0. 

In larger populations, however, new mutations contin- 
uously arise before the older mutants fix. Thus the pop- 
ulation maintains some variation in fitness even while it 
adapts. The distribution of fitnesses within the popula- 
tion is determined by the balance between two effects. 
On the one hand, new mutants arise at the high-fitness 
"nose" of this distribution, generating new mutants more 
fit than any other individuals in the population. This in- 
creases the variation in fitness in the population. (While 
new mutations occur throughout the fitness distribution, 
the mutations essential to maintain variation are those 
that arise at the nose and generate new most-fit indi- 
viduals.) On the other hand, selection destroys less-fit 
variants, increasing the mean fitness and decreasing the 
variation in fitness within the population. This is illus- 
trated in Fig. |2] 

We showed in previous work that this balance between 
mutation and selection leads to a constant steady state 
distribution of fitnesses within the population, measured 
relative to the current (and constantly increasing) mean 
fitness [11 . In this steady state distribution, the fraction 
of individuals with k beneficial mutations relative to the 
current mean in the population is typically 



= Ce 



(2) 



where f is defined below and C is an overall normaliza- 
tion constant that will not matter for our purposes. Note 
that the distribution (pk is approximately Gaussian. 

This distribution 0/. is cut off above some finite maxi- 
mum k which corresponds to the nose of the distribution, 
the most-fit class of individuals. We define the lead of 
the fitness distribution, qs^ as the difference between the 
mean fitness and the fitness of these most-fit individuals 
(so q is the maximum value of /c; the most-fit individ- 
uals have q more beneficial mutations than the average 
individual). In Desai and Fisher [11 , we showed that 
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(3) 



This is illustrated in Fig. [2] 

Above we have implicitly defined f to be the "establish- 
ment time," the average time it takes for new mutations 



r = 



In' [s/Ub] 
2s In [Ns]' 



(4) 



As we will see below, the characteristic time scale for co- 
alescent properties will turn out to be the time for the 
fitness class at the nose to become the dominant popu- 
lation — i.e. for the mean fitness to increase by the lead 
of the fitness distribution. This takes q establishment 
times, so that the this "nose-to-mean" time is 



: qr 



ln(5//75) 



(5) 



which is roughly independent of the population size for 
sufficiently large N . We note that no single mutant 
sweeps to fixation in this time: rather, a whole set of 
mutants comprising a new fitness class at the nose will 
come to dominate the population a time Tnm later. 



III. THE FITNESS-CLASS COALESCENT 
APPROACH 

We now wish to understand the patterns of genetic 
variation within a rapidly adapting population in the 
clonal interference regime. To do so, we will use a fitness- 
class coalescent method in which we trace how sampled 
individuals descended from individuals in less-fit classes, 
moving between classes by mutation events. In each fit- 
ness class, there will be some probability of coalescence 
events. To calculate these coalescence probabilities, we 
must first understand the clonal structure within each 
fitness class: this we now consider. 



A. Clonal structure 

Each fitness class is first created when a new beneficial 
mutation occurs in the current most-fit class, creating a 
new most-fit class at the nose of the fitness distribution 
(see inset of Fig. [2]). This new clonal mutant lineage 
fiuctuates in size due to the effects of genetic drift and 
selection before it eventually either goes extinct or estab- 
lishes (i.e. reaches a large enough size that drift becomes 
negligible). After establishing, the lineage begins to grow 
almost deterministically. Concurrently additional muta- 
tions occur at the nose of the distribution, also founding 
new mutant lineages within this most-fit class. This pro- 
cess is illustrated in Fig. [3^. 

We wish to understand the frequency distribution of 
these new clonal lineages, each founded by a different 
beneficial mutation. In our infinite-sites model, each such 
lineage is genetically unique. We can gain an intuitive 

understanding of this frequency distribution with a 
simple heuristic argument. After it establishes, the size of 
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FIG. 2. Schematic of the evolution of large asexual populations, from Desai and Fisher [TI]. The fitness distribution within 
a population is shown on a logarithmic scale, (a) The population is initially clonal. Beneficial mutations of effect s create a 
subpopulation at fitness s, which drifts randomly until it reaches a size of order ^, after which it behaves deterministically. (b) 
This subpopulation generates mutations at fitness 2s. Meanwhile, the mean fitness of the population increases, so the initial 
clone begins to decline, (c) A steady state is established. In the time it takes for new mutations to arise, the less fit clones die 
out and the population moves rightward while maintaining an approximately constant lead from peak to nose, qs (here ^ = 5). 
The inset shows the leading nose of the population. 



the current most-fit class, ng_i(t), grows approximately 
deterministically according to the formula 

n,_i(t) = -e('-i)«*, (6) 
qs 

as we described in Desai and Fisher [11]. New mutations 
occur in individuals in this class at rate Ubnq-i{t)^ cre- 
ating even more-fit individuals. Each new mutation has 
a probability qs of escaping genetic drift to form a new 
established mutant lineage. Thus the established mu- 
tant lineage at the nose will on average occur at roughly 
the time t£ that satisfies 

/ qsUbnq_i{t)dt = £. (7) 
Jo 

Solving this for ti and then noting that the size, n^, of the 
i^^ established lineage will be proportional to e^^^'^~'^^\ 



we immediately find 



This provides a good estimate of the typical frequency 
distribution of clonal lineages within this fitness class at 
the nose, each lineage founded by a single new mutation. 

The analysis above describes the clonal structure cre- 
ated as a new fitness class is formed, advancing the nose. 
After approximately f generations, the mean fitness of 
the population will have increased by 5, and the growth 
rates of all the fitness classes we have described will de- 
crease correspondingly. Thus we can strictly only use 
the calculations above up to some finite number of mu- 
tations, £maxi after which all growth rates will have de- 
creased due to the advance of the mean fitness of the 
population. Mutations will continue to occur after this 
time, but their frequency distribution will be slightly dif- 
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FIG. 3. Schematic of the estabhshment and fate of clonal lin- 
eages in a given fitness class, shown for a case where q = 3. 
(a) Three new clonal lineages (denoted in different colors) are 
established at the nose of the fitness distributions by three 
independent new mutations. These lineages have relative fre- 
quencies determined by the timing of these mutations, (b) 
After the population evolves for some time, the class that 
was at the nose of the distribution in (a) is now at the mean 
fitness. The class is still dominated by the three clonal lin- 
eages established while the class was at the nose (subsequent 
mutations represent only a small correction). These three 
clonal lineages have the same relative frequencies as when 
they were established at the nose; these relative frequencies 
remain "frozen" even as the population adapts. 



ferent. Fortunately, in the strong selection regime we 
consider {s ^ Ub)^ the total contribution of all muta- 
tions after this point to the total size of the class is small 
compared to the contributions of the mutations that oc- 
cur while this class is at the nose [11] |^ . We therefore 
neglect this cutoff to the number of mutations that occur 
at the nose, as well as the contribution of later muta- 
tions. This approximation will break down for very large 
samples. However, the errors it introduces can be shown 
to be relatively small even when considering quantities 
such as the time to the most recent common ancestor of 
the whole population. 



Another important aspect of the dynamics that sim- 
plifies the behavior is that despite the changing growth 
rate of the fitness class as a whole, the frequencies of 
the established lineages within the class remain fixed. In 
other words, the clonal structure within the class remains 
"frozen" after it is initially created, rather than fluctu- 
ating with time (see Fig. ^jp). As we will see, this and 
the neglect of late-arising mutations are good approxi- 
mations in the regimes we consider here. 

While our heuristic analysis provides a good picture 
of the typical frequency distribution of clonal lineages 
within each fitness class, it misses a crucial effect. Occa- 
sionally a new mutation at the nose will, by chance, occur 
anomalously early. This single mutant lineage can then 
dominate its fitness class. These events are quite rare, 
but when they do occur this single lineage can purge a 
substantial fraction of the total genetic diversity within 
the population. As we will see, these events together 
with less-rare but still early mutations are essential to 
the understanding the structure of genealogies within the 
population as they lead to a substantial probability of 
"multiple merger" coalescent events. 

To capture these effects, we must carry out a more 
careful stochastic analysis of the clonal structure within 
each fitness class. As before, we focus on the clonal struc- 
ture created when that class was at the nose of the fitness 
distribution, since it remains "frozen" thereafter. To do 
so, we note that the population size at the nose can be 
written as 



(9) 



where n{t) reflects the average growth of all clones due 
to selection, and Ui{t) reflects the stochastic effects of 
a clone generated from mutations at site i (of B total 
possible sites). At late enough times, the distribution of 
Vi becomes time-independent, as shown previously fTT]. 
This time-independent Ui summarizes the combined ef- 
fect of all the stochastic dynamics of mutations at this 
site that are relevant for the long-term dynamics. We 
showed that the generating function of Ui is 



G,(z) = (e-^^^)=exp 



1 



B 

(10)- 

where angle brackets denote expectation values, the last 
equality follows for large 5, and we have defined 



a = 1 



(11) 



The total size of this fitness class is proportional to 



a = y Vi. 



(12) 



This generating function Gi{z) for the size of the clonal 
lineage founded at each possible site contains all of the 
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relevant information about the lineage frequency distri- 
bution, including the stochastic effects described above. 
Below we will use it to calculate coalescence probabili- 
ties within our fitness-class coalescent approach, which 
we now turn to. 



B. Tracing Genealogies 

To calculate the structure of genealogies, we take a 
fitness-class approach analogous to the one we used to 
analyze the case of purifying selection [37] . We first con- 
sider sampling several individuals from the population. 
These individuals come from some set of fitness classes 
with probabilities given by the frequencies of those fitness 
classes, (j)k. We note that in the purifying selection case, 
fluctuations in the (j)k due to genetic drift were a poten- 
tial complication in determining these sampling proba- 
bilities. Here, these fluctuations are much less important 
provided that U^/s ^ 1. We note however that fluctu- 
ations in different (j)k are correlated due to the stochas- 
ticity at the nose. Furthermore, averages of (j)^ are far 
larger than their typical values due to rare fluctuations. 
Such fluctuations, which we discuss in detail elsewhere 
^45j . may lead to some slight corrections to our results. 
But for most purposes, the typical values of the are 
what matters: thus we make the simple approximation 
that the probability of sampling one individual from class 
ki and a second from class /c2 is simply (j)ki4'k2^ with (j)k 
as given in Eq. ([2|. Analogous formulas apply for larger 
samples. 

Each sampled individual comes from a specific fitness 
class /c, and belongs to a specific clonal lineage within 
that class. This clonal lineage was created when this 
fitness class was at the nose of the distribution, approx- 
imately {q — k)f generations ago. It was created by a 
single new mutation in an individual from what is now 
fitness class k — 1. That individual in turn belonged to 
some clonal lineage within class /c — 1, which in turn was 
created when that class was at the nose by a new muta- 
tion in an individual from what is now fitness class /c — 2, 
and so on. 

We now describe the probability of a genealogy relating 
a sample of several individuals. Imagine, for simplicity, 
that we sampled two individuals that both happened to 
be in the same fitness class, k. If these individuals were 
from the same clonal lineage within that class, then they 
are genetically identical at all the B positively selected 
sites. We say they coalesced in class k and did so when 
this class was at the nose of the fitness distribution, ap- 
proximately {q — k)f generations in the past. If these 
individuals were not from the same clonal lineage within 
the class, then they both descended from individuals, in 
what is now fitness class /c — 1, that got distinct beneficial 
mutations. If the individuals in which these mutations 
occurred are from the same clonal lineage within class 



/c — 1, we say the sampled individuals coalesced in class 
k — 1. If so, they differ at two of the B positively selected 
sites, and coalesced when class k — 1 was at the nose of 
the fitness distribution, approximately [q — {k — l)]f gen- 
erations ago. If not, they descended from individuals, in 
what is now fitness class /c — 2, that got distinct bene- 
ficial mutations, and so on. We can apply similar logic 
to larger samples or when the individuals were sampled 
from different fitness classes. We illustrate this fitness- 
class coalescent process in Fig. |4] 

We note that the probability a sample of individuals 
comes from the same clonal lineage is the same in each 
fitness class, since the clonal structure of the class was 
always determined when that class was at the nose of 
the distribution (nevertheless, conditional on some indi- 
viduals coalescing in a class, the probability of additional 
coalescence events is substantially altered; see below). In 
addition, the coalescence probabilities do not depend on 
when the mutations occurred in the ancestral lineages 
of each sampled individual, since all clonal lineages were 
founded when a class was at the nose of the fitness dis- 
tribution. These are major simplifications compared to 
the case of purifying selection, where the relative timings 
of mutations and the differences in clonal structure in 
different classes are important complications [371 US] • 

To use the fitness-class coalescent approach to calcu- 
late the probability of a given genealogical relationship 
among a sample of individuals from the population, it 
only remains to calculate the probabilities that arbitrary 
subsets of these individuals coalesced within each fitness 
class. In the next section, we use the above described 
clonal structure to compute these fitness-class coales- 
cence probabilities. 



C. Fitness-class coalescence probabilities 

We begin our calculation of the fitness-class coales- 
cence probabilities by considering the probability that 
H individuals coalesce to 1 in a given class. We call this 
probability Dhi. This coalescence event will occur if and 
only if all H of these individuals are members of the same 
clonal lineage. The probability an individual is sampled 
from a clone of size u is u/cr^ so summing over all possible 
clones we have 



D 



HI 



(13) 



with a = X^i^i- Appendix A we use the expression 
for distribution of u from Eq. (10), and take the B ^ oo 
limit, to find 

_ r(g-a) 

- mm -ay 

We can use a similar approach to calculate the prob- 
abilities of more complicated coalescence configurations. 
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FIG. 4. Schematic of the fitness-class coalescent process. The distribution of fitnesses within the population is shown (here for 
a case where the nose is ahead of the mean by ^ = 6 beneficial mutations). Clonal lineages founded by individual beneficial 
mutations are shown in different colors within each fitness class. Three individuals {A, B, and C) were sampled from the 
population, from classes /c = 3, /c = 2, and /c = 1 respectively. The ancestors of individuals A and B descended from individuals 
in the silver lineage in fitness class /c = 0, and this individual shared a common ancestor with individual C in the gray lineage 
in class k — —3. Individuals A and B differ by 5 beneficial mutations, while individual C differs by 7 beneficial mutations from 
the common ancestor of B and C. Individuals A and B coalesce when the silver lineage in class /c = was originally created 
which occurred when this class was at the nose of the fitness distribution. Tab = 6t generations ago. Individuals A, B, and C 
last shared a common ancestor when the gray lineage in class /c = — 3 was originally created when this class was at the nose of 
the fitness distribution, Tmrca = 9f generations ago. 



Consider the general situation where H individuals coa- 
lesce into K in di given fitness class, with hi individuals 
coalescing into lineage 1, h2 individuals coalescing into 
lineage 2, and so on, up to Hk individuals coalescing into 
lineage K (note that ^f=i hj = H). In Appendix A, we 
show that this probability, CH^K,{hj}i is given by 



C 



Ha 



K-l 



H,K,{h,} 



K 



K 

n 



T{hj - a) 



T(hj + l)r(l - a) 



(15) 



In order to compute any quantity that depends on ge- 
nealogical topologies, it will be important to know not 
just that H individuals coalesced into K lineages, but 
that they did so in a specific configuration {hj}. For 
example, if we have four individuals coalescing into two, 
this could occur by three of them coalescing into one and 



the other lineage not coalescing, or alternatively by two 
pairwise coalescence events. These different topologies 
will affect some aspects of molecular evolution such as 
the polymorphism frequency distribution. To compute 
these quantities, we must work with the full coalescence 
probabilities in Eq. (15). 



However, the specific coalescence configurations do not 
affect non-topology-related quantities such as the total 
branch length, time to most recent ancestor, or any 
statistics that depend on these quantities (e.g. the total 
number of segregating sites Sn)- To compute the statis- 
tics of these aspects of genealogies, we only need to know 
H and K. Thus it will be useful to sum the probabilities 
of all possible configurations {hj} that lead to a partic- 
ular K. We call this total probability of H individuals 
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coalescing to K lineages Dhk- We have 

Dhk = CH,K,{hj}^ 
{hj} 



(16) 



where the sum over the {hj} is constrained to values such 
that = H. 

To compute Dhk^ we first make the definition 



K 



r(/ii - a) 



and note that 



(17) 



(18) 



We can compute f{H^ K) using a simple contour integral, 

where the integral is taken circling the origin. We can 
alternatively the generating function for /(i^, K)^ 



In Appendix A, we show that 

Rf{z) = [i-{i-zrf. 



(20) 



(21) 



We can now compute f{H^ K) for arbitrary H and K by 
noting that 



f{H,K) 



1 d" 



Rf{z) 



z=Oi 



(22) 



and substitute this into Eq. (18) to compute Dhk- To 
give a few examples, we find 



D21 



1 



2q\ q 



(23) 
(24) 
(25) 



Taking more derivatives, we can easily make a table of 
f{H^K) and evaluate any arbitrary Dhk- We note that 
in the large H limit, one can directly obtain /(i^, K) us- 
ing saddle point evaluation of the contour integral defined 
above. 

Note that the case of rapid adaptation, for which clonal 
interference is pervasive, corresponds to the case where q 
is reasonably large (conversely q = 1 corresponds to se- 
quential selective sweeps, and our analysis does not apply 



in this limit). In the large-g regime, D21 is small. In neu- 
tral coalescent theory, the probability of a three-way co- 
alescence event would then be even smaller: D^i ~ ^21- 
However, this is not the case here: the probability three 
lineages coalesce is of the same order as the probability 
two lineages coalesce, D^i ^ 1^21, so "multiple- merger" 
coalescence events are not uncommon. This is a signature 
of the fact that occasionally a fitness class is dominated 
by a single large clone, as described above. When this 
happens, that clone dominates the structure of genealo- 
gies, as any ancestral lineages we trace through the fitness 
distribution are very likely to have originated from this 
single large lineage, and hence will coalesce within this 
fitness class. Although these anomalously large clones 
are rare, they are sufficiently common that they are re- 
sponsible for a significant fraction of the total coalescence 
events, and they are responsible for tendency of genealo- 
gies to take on a more "star-like" shape. 



IV. GENEALOGIES AND PATTERNS OF 
GENETIC VARIATION 

From the results above for the probabilities of all pos- 
sible coalescence events in each fitness class, we can cal- 
culate the probability of any genealogy relating an arbi- 
trary set of sampled individuals. From these genealogies, 
we can in turn calculate the probability distribution of 
any statistic describing the expected patterns of genetic 
diversity in the sample. 

We begin by neglecting neutral mutations and calculat- 
ing the structure of genealogies in "fitness-class" space. 
That is, we consider individuals sampled from some set 
of fitness classes. We trace their ancestries backwards 
in time as they "advance" from one fitness class to the 
next, via mutational events, and calculate the probability 
that they coalesce in a particular set of earlier-established 
classes. Since each step in the fitness-class coalescent tree 
corresponds to a beneficial mutation, this immediately 
gives us the pattern of genetic diversity at the positively 
selected sites. We later consider how these "fitness-class" 
genealogies correspond to genealogies in real time, and 
use this to derive the expected patterns of diversity at 
linked neutral sites. 



The distribution of heterozygosity at positively 
selected sites 



We first describe the simplest possible case, a sample 
of two individuals. If we sample two individuals at ran- 
dom from the population, the first comes from class ki 
and the second from class /c2 with probability (j)ki4^k2' If 
these two individuals coalesce in class ^, their total pair- 
wise heterozygosity at positively selected sites, 7r5, will 
be {ki - ^) + {k2 -£) = ki^k2- 21. 
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We can now calculate the average tt;, given ki and fca 
by noting that 



«,fcj = 1^2 -fell + «fc)- 



(26) 



By conditioning on whether two individuals sampled 
from class k coalesce within that class (in which case 
they have tt^ = 0), we have 

= 0Z)2i + (1 - Z)2i) [{4,k) + 2] , (27) 



which implies 



, ,2(1-1^21) 



21 



(28) 



Plugging this into the above, we find 



<-U) = 1^2-^11 + ^^^^. (29) 



21 



We can now average this over hi and k2 to find the overall 
average. Since hi and k2 are approximately normally 
distributed with variance l/(5f), their average absolute 



difference is ^J4:/{sT^l). Thus we have 



nsf 



D21 



(30) 



Note that for large the second term (corresponding 
to heterozygosity between individuals sampled from the 
same class) is approximately 2q, while the first term is 
approximately ^/Aq/ir \og{s/Ub), which is smaller by a 
factor of 1/ \/27r log(A^s). This is because most individ- 
uals are much closer to the mean than to the nose, so 
that — /C2I <C (7. In other words, a rough but very 
simple approximation is to assume that all individuals 
are sampled from the mean fitness class. 

We can use a similar approach to compute the full 
probability distribution of tt^. We have 

P{4,k=l) = D2iS^,o + {l-D2i)P{4,k = 7-2), (31) 



which implies that 

p(^5^^^)_/ 1^21(1 -1^21)-/^ for 7 even 







for 7 odd 



(32) 



We can then write the more general result 

Pi4,M = 7) = D2i5^M-kM^-D2i)P{nlM = 7-2) 
from which we find 



(33) 



7-(fcl -fc2) 



for 



7-(fci-fc2) 



even and 7 > /ci — /c2 



otherwise 



(34) 



If desired, we can now average these results over the dis- 
tributions of ki and /c2 to get the unconditional distribu- 
tion of TTfo. In Fig. and Fig. [5]3, we illustrate these 
theoretical predictions for the overall distribution of pair- 
wise heterozygosity with the results of full forward-time 
Wright-Fisher simulations, for two representative param- 
eter combinations. We see that the distribution of het- 
erozygosity has a nonzero peak, and that the agreement 
with simulations is generally good. 

We emphasize that our results for Pij^h) describe the 
ensemble distribution of heterozygosity. That is, if we 
picked a single pair of individuals from each of many in- 
dependent populations^ this is the distribution of 7r5 one 
would expect to see. It is not the population distribu- 
tion: if we were to pick many pairs of individuals from 
the same population, the 7r5 of these pairs would not be 
independent because much of the coalescence within in- 
dividual populations occurs in rare classes that are dom- 
inated by a single lineage for which D21 is much higher 
than its average value. Thus if we measured the average 



7r5 within each population by taking many samples from 
it, the distribution of this 715 across populations would be 
different from the distribution computed above. In order 
to understand these within-population correlations, we 
now consider the genealogies of larger samples. 



B. Statistics in larger samples 

We can compute the average and distribution of statis- 
tics describing larger samples in an analogous fashion to 
the pair samples. For example, consider the total number 
of segregating positively selected sites among a sample of 
3 individuals, which we call S^^h- These three individ- 
uals are sampled (in order) from classes /ci, /c2, and k^ 
respectively with probability <pki4^k24'k3- For three indi- 
viduals sampled from the same fitness class /c, by condi- 
tioning on the coalescence possibilities within class k we 
find that the average total number of segregating posi- 
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FIG. 5. The distribution of pairwise heterozygosity, (a) Comparison of our theoretical predictions for the distribution of 
pairwise heterozygosity at positively selected sites, 7Tb with the results of forward-time Wright-Fisher simulations, for N = 10^, 
s = 10~^, and Ub = 10~^. Simulation results are an average over 56 independent runs, with 10^ pairs of individuals sampled 
from each run. (b) Pairwise heterozygosity at positively selected sites for N — 10^, s — 10~^, and Ub — 10~^. (c) Comparison 
of our theoretical predictions for the distribution of pairwise heterozygosity at linked neutral sites, TTn with the results of 
forward-time Wright-Fisher simulations, for N — 10^, s — 10~^, Ub — 10~^, and Un — 10""^. (d) Pairwise heterozygosity at 
linked neutral for N = 10^ s = 10"^ Ub = 10"^ and Un = 10"^ 



tively selected sites is 

{Skkk) = ODsi + Ds2 [2 + (ttI^)] + 1^33 [3 + evS^kk] 
Solving this for (Skkk)^ we find 

2Ds2/D2i^3D: 



{Skkk) = 



More generally we have 



33 



32 



{Sk,k,k,) = {l-D2i)''-'^ [2{k2 

k2 — ki — l 



D2l(l-I>2ir [fc2-fcl 



fefcfej • 

(35) 



(36) 



ki) + {Skkk)] (37) 



and even more generally we have 

{Skik^ka) = ks — k2 + {SkT_k2k2) ■ 



(38) 



If desired, we can average these over the distribution of 
/ci, /c2, and k^, using the properties of differences of Gaus- 
sian random variables, as above. Alternatively, as in sam- 
ples of size two, in large populations we can make the 
rough approximation that all sampled individuals come 
from the mean fitness class. Analogous calculations can 
be used to find the average number of segregating posi- 
tively selected sites in still larger samples. 

In Fig. |6] we illustrate some of these predictions (in 
practice samples are generated from coalescent simula- 
tions; see below) for samples of size 2,3, and 10, and com- 
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FIG. 6. Comparisons between theoretical predictions (from 
coalescent simulations) and forward-time Wright- Fisher simu- 
lations for the average pairwise heterozygosity and total num- 
ber of segregating sites in samples of size 3 and 10 at positively 
selected sites and at linked neutral sites, (a) as a function of 
Ub/s and (b) as a function of Ns. In both panels, N = 10^ 
and Ub = 10~^ while s is varied. Forward-time Wright-Fisher 
simulation data represents an average over 56 forward simu- 
lation runs, with 10^ pairs of individuals sampled from each 
run. Theoretical predictions generated using backwards-time 
coalescent simulations represent the average of 3 x 10^ inde- 
pendently simulated pairs of individuals. Note that both (a) 
and (b) show the same data, plotted as a function of different 
parameters. 



pare these to the results of forward-tirae Wright-Fisher 
simulations. We note that the agreement is generally 
good. 

We can apply similar thinking to describe the distri- 
bution of the total number of segregating selected sites. 
First consider this distribution for a sample of size 3, all 



of which happen to be sampled from the same fitness 
class /c, Skkk' We have 

P{Skkk = 7) = DsiS^,o + Ds2P{7rlk =7-2) 

-^Ds3P{Skkk=l-3)^ (39) 

We can multiply by and sum over 7 to pass to gener- 
ating functions, Us{z) = P{Skkk = l)- This yields 



Us{z) = Dsi + Ds2Z^U2{z) + Dssz^Usiz), 
which we can solve to find 

D31 + Z^D32U2{z) 



1 - D33Z^ 



(40) 



(41) 



where we have introduced the obvious notation. 

More generally, we have that the total number of seg- 
regating sites among a sample of H individuals all chosen 
from the same fitness class /c, which we will call Sh^ has 
the distribution 

P{Sh = 7) = DhiS^^o + Dh2P{S2 = 7 - 2) + (42) 
DnsPiSs = 7 - 3) + . . . DhhP{Sh = 1-H). 

We can again pass to generating functions, giving 

(43) 

which we can easily solve to give 
Dhi 



Uh{z) = Dhi + Dh2z'^U2{z) + Dhzz^'U^ 



Uh{z) = 



- Yl"=2 z^DHeUejz) 
1 - Dhhz" 



(44) 



It still remains to consider the distribution of the total 
number of segregating selected sites among H individu- 
als chosen at random from arbitrary fitness classes. The 
general case becomes quite unwieldy to compute analyt- 
ically, because we must average over all fitness classes in 
which internal coalescence events can occur. Computing 
these averages for the case of a sample of size three, we 
find that the generating function for the distribution of 
the total number of segregating positively selected sites 
among a sample of three individuals sampled from classes 
/ci, A:2, and ks is given by 



yki-k: 



W3{z\ki,k2,ks 



''U2{Z)D21 1-{ZD22 



\ki—k2 



22 



1 - D22Z 
-"'Usiz). 



(45) 



Note that these distributions are all for samples each 
taken from an independently evolved population, rather 
than found from averaging many samples from each pop- 
ulation and then finding the distribution of this across 
populations. 

Analogous expressions can be computed for larger sam- 
ples, but these involve ever more complex combinatorics. 
One may also wish to compute other statistics describ- 
ing genetic variation in larger samples, such as the allele 
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frequency spectrum. While in principle it is possible to 
calculate analytic expressions for any such statistic using 
methods similar to those described above, in practice it 
is easier to use our fitness-class coalescent probabilities 
to implement coalescent simulations, and then use these 
simulations to compute any quantity of interest. We de- 
scribe these coalescent simulations in a later section. Al- 
ternatively, for large populations we can make use of the 
rough approximation that all individuals are always sam- 
pled from the mean fitness class; we explore some conse- 
quences of this approximation further in a later section 
below. 



C. Time in generations and neutral diversity 

Thus far we have focused on the fitness-class struc- 
ture of genealogies and the genetic variation at positively 
selected sites. We now describe the correspondence be- 
tween our "fitness-class coalescent" genealogy and the ge- 
nealogy as measured in actual generations. Fortunately, 
this correspondence is extremely simple: each clonal lin- 
eage was originally created by mutations when that fit- 
ness class was at the nose of the fitness distribution. Thus 
if we define the current mean fitness to be class /c = 0, 
the current nose class will be at approximately k = 
and some arbitrary class k will have been created at the 
nose approximately {q — k)f generations ago. Although 
there is some variation in each establishment time, we ne- 
glect this variation throughout our analysis here, since it 
is small compared to the variation between coalescence 
times within clones in different classes. As we will see 
below, this approximation holds well in comparison to 
simulations in the parameter regimes we consider. This 
makes the correspondence between real times and step- 
times much simpler here than in our previous analysis 
of purifying selection, where the variation in real times, 
even given a specific fitness-class coalescent genealogy, 
was substantial [37 . 

The simple approximation of neglecting the variations 
in time of establishment of the fitness classes allows us 
to make a straightforward deterministic correspondence 
between the fitness-class coalescent genealogy and the 
coalescence times. We can then compute the expected 
patterns of genetic diversity at linked neutral sites: the 
number of neutral mutations on a genealogical branch of 
length T generations is Poisson distributed with mean 
UnT. From this we can compute the distribution of 
statistics describing neutral variation (e.g. the neutral 
heterozygosity tt^ or total number of neutral segregating 
sites in a sample Sn) from the corresponding statistics 
describing the variation at the positively selected sites. 
We illustrate these theoretical predictions for the distri- 
bution of neutral heterozygosity tt^ in Fig. [S]^ and Fig. 
[5]i, and compare these predictions to the results of full 
for ward- time Wright- Fisher simulations. In Fig. [6] we 



also show our predictions (generated using the coales- 
cent simulations described above) for the mean number 
of segregating neutral sites in samples of size 2, 3, and 10, 
compared to the results of forward-time Wright-Fisher 
simulations. We note that the agreement is good across 
the parameter regime we consider, though there are some 
systematic deviations for smaller values of Ui)/s where 
our approximations are expected to be less accurate. 



D. Time to the Most Recent Common Ancestor 

Thus far we have considered the coalescence events at 
each mutational step separately: this is necessary to de- 
scribe the full structure of genealogies. However, another 
important quantity of interest is the time to the most re- 
cent common ancestor — i.e. the coalescence time of the 
entire sample. We begin by considering this time mea- 
sured in mutational steps, and then describe how this 
relates to the coalescence time measured in generations. 

We can derive relatively simple expressions for the 
number of mutational steps to coalescence of an entire 
sample by directly calculating the probability of coales- 
cence events over several steps at once. To do so, we 
note that since the dynamics at each mutational step are 
identical, the generating function of the number of indi- 
viduals descended from a mutation at site i that occurred 



mutational steps ago, , is given by 



where we have defined 



e y\ ) = exp 



B 



(46) 



(47) 



From this expression, we can immediately compute the 
distribution of the number of mutational steps to coales- 
cence of H individuals sampled from the same fitness 
class, J{H). The cumulative distribution of J is given 
by 



F{HJ) = Prob [J{H) < ^] ^ 



H 



1 \)Zi=i 



■ (48) 



We can compute {F{H^i)) using identical methods to 
those used to calculate the fitness-class coalescence prob- 
abilities above, and find 



{F{H,£)) 



r(j?)r(i-%)- 



From this, we find 

oo 

{J(ff)) = ^(l-(F(F,f))). 



(49) 



(50) 



£=0 
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Note that we could alternatively obtain expressions for 
J{H) more directly from the fitness-class coalescence 
probabilities in a single step, by conditioning on the co- 
alescence events that can happen in the first step in a 
similar way to that we used to compute (tt^) and (S's^). 

In the large-(7 limit, the ratios of these coalescence 
times (measured in mutational steps) in samples of dif- 
ferent sizes are independent of q: 



(J(3)) _ 5 (J(4)) _ 25 (J(5)) _ 427 
(7(2)) ~ 4' (7(2)) ~ 18' (7(2)) ~ 288* 



(51) 



These ratios are identical to those given by the 
Bolthausen-Sznitman coalescent [47 , which has recently 
been shown to describe a number of other very different 
models of selection [48]. We return to this point in the 
Discussion. For large H we find 



{J{H)) 



^loglogiJ + C'(l). 



(52) 



These results suggest that there is a g'-independent lim- 
iting process: we discuss this briefly below. We also note 
that the distribution of times to coalescence for large H is 
quite different than in the neutral case — the between- 
populations variation in J{H)/ (J(2)) is only of order 
unity, compared to its mean of log log i^. In contrast, for 
the neutral coalescent, the time to last common ancestor 
of the whole population has mean of 2 ( J(2)) and random 
variations of the same order. 

As with other aspects of genealogical structures, it is 
straightforward to convert these expressions for the coa- 
lescence times measured in mutational steps to the time 
in generations to the most recent common ancestor of a 
sample, Tmrca{H). Specifically, J = £ corresponds to 
the case where the most recent ancestor occurs i muta- 
tional steps ago, so if the sampled individuals were from 
class k the time to the most recent common ancestor is 
[q — {k — £)]f generations. We note that for a sample 
of two this implies that the nose-to-mean time Tnm is 
the characteristic time scale of the coalescent, as claimed 
above. 

Thus far we have considered the most recent common 
ancestor of H individuals all sampled from the same fit- 
ness class k. However, in general we will typically sam- 
ple individuals from a variety of different classes. In this 
case, we must sum over all possible internal coalescence 
events, until we reach a state where all remaining ances- 
tral lineages are together in the same fitness class. This 
quickly becomes unwieldy in larger samples. In practice, 
it is easier to compute times to the most common re- 
cent ancestor in these cases using coalescent simulations 
based on our fitness-class coalescent approach, which we 
describe below. 

As with other statistics described above, however, 
there is a simple approximation which is asymptotically 
correct for large populations: we can simply assume that 



all individuals are sampled from the mean fitness class. 
This approximation relies on the fact that most individ- 
uals sampled randomly from the population will have 
fitnesses close to the mean: within of order ^/v of it. 
Thus the time differences between their establishments 
will typically be substantially smaller than the nose-to- 
mean time. Trim' As this is the time scale on which typical 
coalescent events take place, treating all the individuals 
as if they were in the dominant fitness class is a reason- 
able rough approximation. In this approximation, the 
results for the times to most common ancestor for sam- 
ples of H can be simply obtained from the single-fitness 
class results above. We find: 



{TMRCAi"^)) ^ 2rnm 

and in larger samples we have 

{Tmrca{3)) ^ 9 
{Tmrca{'^)) 8' 

{Tmrca{^)) ^ 43 
{TMRCAm 36' 



MRCA 



(5)) 



{Tmrca(2)) ~ 576' 



(53) 



(54) 



(55) 



(56) 



We note however that the dominant-fitness-class approx- 
imation is valid only in the limit that the lead of the pop- 
ulation, gs, is much larger than the standard d eviation o f 
the fitness distribution, ^/v. As this ratio is \og{Ns)^ 
in practice it never becomes very large. 



E. The Frequency of Individual Mutations 

An alternative way to compute many of the coalescent 
properties is to consider the fraction of the population 
with a particular mutation, which is closely related to 
the site frequency spectrum. The frequency of a given 
mutation at a particular site is determined by when that 
mutation occurred relative to others in its fitness class. 
In addition, its frequency at later times is determined by 
whether or not later mutations occur in its genetic back- 
ground at each subsequent mutational step. Consider a 
mutation that occurred £ steps in the past, and define 
f = - to be the fraction of the current nose class that its 
descendants constitute. The probability density of / is 



df 



Br{7^e)^{l-r]e)p+vi{^-f) 



(57) 



where as before we have defined r]£ = (1 — l/qY. Coa- 
lescent properties depend on averages of f^. Summing 
over all B sites and using the standard integrals of powers 
of / and 1 — / expressed in terms of gamma functions. 
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we obtain immediately the result we had found above: 
IF(H jf)) - ^^^-"^^^ 

More generally, one can consider how the frequency of 



a mutation changes in time due to successive mutations 
in its lineage. If a given mutation has frequency g at 
one time, then a time If later (after ^ further beneficial 
mutations have occurred) the probability density of its 
frequency will be: 



t{f\9)df = df 



9(i-g) 
r(,7«)r(i-%)/(i-/) 



(1 - 9? (t4) ' + 9^ (Y) ' + 25(1 - 9) cos(7rr,,) 



(58) 



From this, quantities such as the variance of the proba- 
bility of H individuals coalescing £ steps in the past and 
hence the variances in the coalescent times of H individ- 
uals can be computed. 

In the limit of large the exponent r] that parameter- 
izes the time difference, t = if, is simply r] ^ e"^/'^^"^. 
This is independent of q: only the "nose-to- mean" time 
that it takes for the new mutants to dominate the pop- 
ulation matters. In this limit, a single mutational step 
occurs in a time that is a very small fraction, e = 1/q, of 
the nose-to-mean time rnm- The conditional probability 
of going from ^ to / in this step is 



enters a class with other individuals, we use Eq. (15) 



Piif\9)df 



5(1 - 9)e df 



(/_g)2 + ^2g2[^(l_^)]2 



(59) 



Eq. [59]is an approximate delta-function in / — ^, as one 
would expect in the limit of a small time step. But it 
also corresponds to a probability per unit time of a jump 
from ^ to / of :^dfg{l - g)/{f - g^ . Specifically it 
describes the genetic background either containing the 
mutation (frequency g) or not containing the mutation 
(frequency 1 — g) increasing in size by a factor between 
l + /i and l^h^dh with rate -;^dh/h^ (with e providing 
a small h cutoff). This corresponds to a continuous time 
birth process in a sub-population of (large) size n with 
rate per individual to give birth to k offspring, :^r—^' 
These considerations provide an alternative way to com- 
pute coalescent statistics. 



F. Coalescent Simulations 

We can use the fitness-class coalescence probabilities 
in Eq. (15) to implement an algorithm for coalescent 

t29, , using the 



simulations along the lines of Gordo et 
structured coalescent framework of Hudson and Kaplan 
[38] . Specifically, to describe the diversity in a sample 
of n individuals, we first randomly sample their fitness 
classes independently from the distribution We then 
start with the individual in the most-fit class, and trace 
back its ancestry as it steps through successive classes 
within the fitness distribution. When that individual 



to determine the probabilities of all possible coalescence 
events in that class. We then continue to trace back the 
ancestry of the sample further through the distribution, 
allowing for coalescence events at each step according to 
the appropriate probabilities. We continue this proce- 
dure until all individuals have coalesced. 

This simple coalescent algorithm produces a fitness- 
class coalescent tree drawn from the appropriate proba- 
bility distribution of genealogies. We can then compute 
any statistic of interest describing this genealogy. By 
repeating this algorithm, we can obtain the probability 
distribution of the statistic. In practice this is a highly ef- 
ficient procedure, since the coalescent simulations are ex- 
tremely fast and the computational time required scales 
only with the size of the sample rather than the size of 
the population. 



G. Comparison to Simulations 

Our coalescent simulations represent an algorithmic 
implementation of our fitness-class coalescent, using all 
of the analytical expressions for the sampling and coales- 
cence probabilities described above. Thus these coales- 
cent simulations rely on all of the approximations under- 
lying our method. To test the validity of these approxi- 
mations and the accuracy of our fitness-class coalescent 
method, we compared the predictions of these coalescent 
simulations to full forward-time Wright-Fisher simula- 
tions of our model. These comparisons are illustrated 
in Fig. [5] and Fig. |6] and in Table [ij 

Our Wright-Fisher simulations were implemented as- 
suming a population of constant size A^, in which each 
generation consisted of a mutation and a selection step. 
In the mutation step, we independently choose the num- 
ber of beneficial and neutral mutations within each ex- 
tant genotype from the appropriate multinomial distri- 
bution. Each new mutation was assigned a unique index 
and all unique genotypes were tracked. In the selection 
step, we sample N individuals with replacement from 
the previous generation, using a multinomial sampling 
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Uh/S 


Ns 


DiQ theory 


DiQ simulations 


0.2000 


5000 


-3.3199 


-3.3378 


0.1000 


10000 


-3.3489 


-3.3569 


0.0200 


50000 


-3.3533 


-3.3322 


0.0100 


100000 


-3.3571 


-3.4188 


0.0020 


500000 


-3.3665 


-3.3024 


0.0010 


1000000 


-3.3717 


-3.3670 



TABLE I. Comparisons between theoretical predictions (from 
coalescent simulations) and forward-time Wright-Fisher sim- 
ulations for Tajima's D 49 in a sample of size 10, Dio- Here 
Uh — 10~^ and N — 10^ while s is varied. Theoretical pre- 
dictions are obtained by sampling 10^ backward coalescent 
simulations. Forward-time simulation results are an average 
over 56 forward simulation runs, with 10^ samples of n = 2 
and n = 10 individuals. 



weight adjusted for selective differences between individ- 
uals relative to the population mean fitness [50] . 



V. DISCUSSION 

We have developed a fitness-class coalescent method 
to calculate how positive selection on many linked sites 
alters the structure of genealogies. This has allowed us 
to calculate how clonal interference shapes the patterns 
of genetic diversity in rapidly adapting populations. Our 
approach moves away from the traditional method of cal- 
culating the structure of genealogies in real time. Rather, 
we treat each mutational step from one fitness class to 
the next as an "effective generation," and trace how a 
sample of individuals descended by mutations through 
these fitness classes. In each "effective generation" we 
calculated the total probability of all possible coalescence 
events, Eq. (15). This allows us to calculate the struc- 
ture of genealogies in this "fitness-class space," which 
directly corresponds to the genetic diversity at positively 
selected sites. We then converted this fitness-class coa- 
lescent to the genealogy in real time in order to calculate 
the expected patterns of neutral diversity. 

We have shown that we can use this approach to com- 
pute analytic expressions for the distributions of several 
simple statistics describing patterns of molecular evolu- 
tion. However, it is often easiest to compute expected 
patterns of variation using backwards-time coalescent 
simulations which explicitly implement the fitness-class 
coalescent algorithm using the distribution of the frac- 
tion of the population in each fitness class (j)k and the 
coalescence probabilities in Eq. ( 15 ) to simulate genealo- 
gies. These coalescent simulations are extremely efficient, 
and in practice it is usually faster to run millions of 
these backwards-time simulations than it is to numeri- 
cally evaluate the sums over fitness classes involved in 
the corresponding exact analytic expressions. These coa- 



lescent simulations also have the advantage of being very 
similar in spirit to structured coalescent simulations that 
describe the effects of purifying selection (see e.g. Gordo 
et al. [39 and Seger et al. [40 ), so they can in principle be 
used for parameter estimation and inference in analogous 
ways. 

Our analysis throughout this paper is very similar in 
spirit to the fitness-class coalescent method we previously 
used to describe how purifying selection at many linked 
sites alters the structure of genealogies and patterns of 
molecular evolution [37^ 46 . However, there are two im- 
portant technical differences. First, in the case of pu- 
rifying selection, fluctuations in the frequencies of each 
fitness class due to genetic drift can be substantial in 
certain parameter regimes. These fluctuations are partic- 
ularly important near the nose of the distribution, where 
they can lead to effects such as Muller's ratchet. Al- 
though individuals are unlikely to be sampled from this 
nose, they are very likely to coalesce there. Neglecting 
these fluctuations was therefore an important approxima- 
tion that substantially restricted the regime of validity of 
our analysis. By contrast, in the case of positive selection, 
fluctuations in the sizes of each fitness class are negligi- 
ble (except at the nose) across a broad range of relevant 
parameter values. Furthermore, fluctuations at the nose 
are much less important for patterns of diversity than in 
the case of purifying selection, because individuals are 
unlikely to either be sampled there or to coalesce there. 
This reflects a fundamental difference between the neu- 
tral and purifying selection processes and the rapid adap- 
tation dynamics analyzed here. For the former, genetic 
drift plays a key role in driving the fluctuations, while for 
the latter, genetic drift is almost irrelevant: the fluctua- 
tions are dominated by the stochasticity in the timings 
of the beneficial mutations that occur near the nose of 
the fitness distribution. 

A second key simplification of our analysis of positive 
selection, compared to the purifying selection case, is 
that the clonal structure of each fitness class becomes 
effectively "frozen" once that class is no longer at the 
nose of the fitness distribution. This means that coa- 
lescence probabilities are identical in all fitness classes 
which stands in contrast to the case of purifying selec- 
tion, where the clonal structure within all classes is con- 
stantly changing. This also avoids the need to carefully 
analyze the timing and order of mutation events in the 
history of a sample and simplifies the mapping between 
our fitness-class coalescent genealogy and the genealogy 
measured in real time. 

Our results demonstrate how positive selection on 
many linked sites distorts the structure of genealogies 
away from neutral expectations. We show several ex- 
amples of these selected genealogies, for various different 
parameter values, in Fig. [7[ The most striking quali- 
tative conclusion of our analysis is that multiple merger 
events, where several ancestral lineages coalesce into one 
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in a single effective generation, occur with comparable 
probabilities to pairwise coalescence events. We note that 
these events are multiple mergers within a single effective 
generation in our fitness-class coalescent, and hence are 
not actually multiple mergers within a single real gener- 
ation. However, these events happen very close together 
in real time compared to the other relevant timescales, 
so they will appear as effectively instantaneous. This 
leads to a more "starlike" shape of genealogical trees. 
This signature is characteristic of the action of positive 
selection; our analysis here illustrates how starlike we ex- 
pect genealogies to be (and how many deeper coalescence 
events are preserved) given the interplay between inter- 
ference and hitchhiking effects characteristic of this rapid 
adaptation regime. It may prove useful in future work to 
analyze this specific situation in the context of more gen- 
eral models of the coalescent with multiple mergers [51]. 

We note that the characteristic time scale of the co- 
alescence is the "nose-to- mean" time, Tnm, which is the 
time after which the collection of new mutants at the nose 
take to dominate the population. In units of this time, 
trees for different values of q become statistically simi- 
lar for large q. One striking feature, that occurs roughly 
once each Tnm^ is the coalescence of a substantial frac- 
tion of all the (remaining) lineages at a single time step: 
this is caused by one new beneficial mutation occurring 
so much earlier than typical that its descendants repre- 
sent a substantial fraction of the population in the nose. 
Examples of this can be seen in Fig. 7. Another perhaps- 
surprising feature of the genealogies in large samples is 
that some aspects are less variable from one population 
to another than neutral coalescent trees, while other as- 
pects are more variable. In the recent past, for times 
much shorter than the mean coalescence time of pairs of 
individuals, neutral coalescent trees, tend to be rather 
similar, while the multiple-coalescence events that char- 
acterize the positively selected genealogies cause larger 
variations between populations. In contrast, the time to 
last common ancestor of large samples is broadly dis- 
tributed for neutral trees but narrowly distributed (at 
least asymptotically) for positively selected trees. 

Because individuals are unlikely to be sampled from 
near the nose of the distribution, the initial coalescence 
events in the history of the sample are typically in the 
bulk of the fitness distribution. Since these coalescence 
events happened well in the past when these classes were 
at the nose of the distribution, the terminal branches in 
the genealogies of a sample are likely to be longer com- 
pared to internal branches than we would expect under 
neutrality. In other words, recent branches of genealo- 
gies are longer relative to more ancient branches. This 
effect is qualitatively similar to the situation in which ef- 
fective population size declines as time recedes into the 
past: this has long been recognized as a general signature 
of the effects of both purifying and positive selection. It 
leads to an excess of singleton mutations in the site fre- 



quency spectrum, and the negative values of Tajima's 
D that we have observed. However, clonal interference 
mitigates these effects relative to a hard selective sweep. 

Our results also demonstrate that even when benefi- 
cial mutations are rare compared to neutral mutations, 
Ub <C Un^ positively selected sites can still contribute 
a significant fraction of the total genetic variation ob- 
served in a population. For example, in a sample of two 
individuals the total heterozygosity at positively selected 
sites will typically be several times q. The typical neu- 
tral heterozygosity, on the other hand, will be of order 

TTn ^ UnTnm- ThuS CVCU whcU Un ^ Ub^ TT^ will oftCU be 

comparable to or even greater than tt^. This is consis- 
tent with the general observation in microbial evolution 
experiments that a substantial fraction of observed mu- 
tations are beneficial [TSl |2l]-[23] . The fact that positively 
selected sites can be a significant fraction of the polymor- 
phisms emphasizes the importance of understanding the 
patterns of diversity at these sites, which have distinct 
patterns compared to linked neutral variation and hence 
may provide important signatures in sequence data of 
adaptation that involves clonal interference. 

Our predictions for the structure of the fitness-class ge- 
nealogies depend on the population size, mutation rate, 
and strength of selection only through the combinations 
log[A^5] and \og[Ub/s]. The timescales in generations are 
also proportional to the inverse of the strength of se- 
lection. Thus the patterns of genetic variation in an 
adapting population depend only very weakly (logarith- 
mically) on population size and mutation rate in the 
large-g' regime where clonal interference is pervasive, sug- 
gesting that there is limited power to infer these parame- 
ters from patterns of molecular evolution. This is a con- 
sequence of the fact that the evolutionary dynamics are 
also only very weakly dependent on these parameters in 
the clonal interference regime. 

We have seen that in the large-g limit of our model, 
the ratios of the number of mutational steps to the most 
recent common ancestors in samples of different sizes are 
exactly equivalent to those expected in the Bolthausen- 
Sznitman coalescent [47 . This is identical to the limiting 
behavior of these ratios in several very different models of 
selection recently studied by Brunet, Derrida, and others 
[52l[56] : see [48 for a recent review. The reason for this 
equivalence between very different models remains un- 
clear, but suggests a degree of universality: an interesting 
topic for future work. We emphasize, however, that the 
times to most recent ancestors in our model reduce to the 
Bolthausen-Sznitman ratios only when measured in mu- 
tational steps and only when all individuals are sampled 
from the same fitness class. The ratios of time to most 
recent common ancestors, measured in generations, have 
a different form. Nevertheless, in the limit of very large 
(7, almost all the individuals will have fitness much closer 
to the mean than to the nose. As the rate of coalescence 
is proportional to the difference between the mean and 
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(a) (b) 




0.0 5.0 0.0 5.0 10.0 



FIG. 7. Examples of fitness-class coalescent genealogies in samples of size 50 from forward-time Wright-Fisher simulations. The 
tips of each tree correspond to individuals sampled from the present. Each tip is placed horizontally according to the fitness 
class from which that individual was sampled (classes are numbered according to the number of beneficial mutations relative to 
the most recent common ancestor of the sample). Coalescence events are depicted according to the fitness class in which they 
occurred. Each unit of time on the horizontal axis corresponds to one beneficial mutation, so that two individuals separated 
by a branch length of I have ttj, = I. These fitness-class genealogies can be converted to genealogies in real time by using our 
approximation that all coalescent events happen when the relevant class was at the nose of the fitness distribution. Note that 
the characteristic time for coalescence is the time it takes for q successive beneficial mutations: this varies considerably with 
the parameters used. In all trees, N = 10^ and Ub = 10~^. (a) An example of a genealogical tree for s = 10~^. (b) An example 
of a tree for s = 5 x 10~^. (c) An example of a tree for s = 10~^. (d) An example of a tree for s = 5 x 10~^. 



the nose, the approximation of sampling only from the 
largest fitness class is asymptotically good. The modifi- 
cations of the Bolthausen-Sznitman ratios are then sim- 
ply determined by adding the nose-to-mean time, (which 
turns out to be equal to the mean pairwise correlation 
time), to all the coalescent times. 

Our analysis in this paper has focused on the simplest 
possible model of positive selection on a large number of 
linked sites, and we have neglected many potential com- 
plications. For example, we have assumed that epistatic 
interactions between mutations can be neglected, and 
that the total potential supply of beneficial mutations 



is not significantly depleted over the course of adapta- 
tion. This is consistent with our focus on rapidly adapt- 
ing populations in the large-g clonal interference regime. 
As a population approaches a fitness peak, these approx- 
imations will likely fail and the dynamics of adaptation 
and patterns of genetic variation may either become more 
complex, or return to the regime where further adapta- 
tion is driven by isolated selective sweeps. We have also 
focused exclusively on beneficial mutations which all have 
the same fitness effect s, and have neglected both dele- 
terious mutations and beneficial mutations which confer 
different fitness effects. This is justified by earlier work 
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by us and others that suggests that in rapidly adapting 
populations, clonal interference ensures that evolution is 
dominated by beneficial mutations that confer a specific 
fitness advantage [TS', |42l |43] . However, we have recently 
analyzed the evolutionary dynamics within a population 
in a model which explicitly allows for a distribution of 
fitness effects of beneficial mutations |T3] . We and others 
have also analyzed the case where a mix of both bene- 
ficial and deleterious mutations are possible [lOl |43l [57] . 
Those works describe the variation in fitness within pop- 
ulations in these more complex models and hence could 
form the basis for a more complex version of the fitness- 
class coalescent method we have used here. This gener- 
alized fitness-class coalescent would admit the possibility 
of mutational steps of various different sizes and towards 
both lower and higher fitness. 

An alternative approach by one of us allows for benefi- 
cial mutations to have a variety of different effects, with- 
out making reference to fitness classes [45 . As long as 
the distribution of fitness effects of potential beneficial 
mutations falls off faster than a simple exponential for 
large 5, the dynamics in large populations is dominated 
by mutations with s close to some value, s [I3l [45]. In 
this case, most properties of the dynamics on time scales 
longer than the nose-to-mean time Tnm are quite univer- 
sal (and more strongly so when v/s'^ is large). As r^m is 
also the time scale of the coalescence, this suggests that 
the coalescent statistics should also be universal. The 
continuous time results quoted above for the evolution of 
the frequency of a sub-population emerges naturally in 
this more general analysis, and indeed correspond to the 
universal limit of asymptotically-large populations [45] . 
In the alternative regime where the distribution of fit- 
ness effects of potential beneficial mutations falls of more 
slowly than exponentially, mutations can jump from the 
bulk of the distribution to the lead. These play an impor- 
tant role in the dynamics, and cause q to remain small 
even for asymptotically large populations [11 . The be- 
havior is then less universal, but this situation is likely 
to be relevant in real populations, especially in the ini- 
tial stages of adaptation to a new environment. Further 
study into these effects of the distribution of effects of 
beneficial mutations, of initial transient dynamics, and 
of large numbers of deleterious mutations are interesting 
topics for future research. 

The final simplification of our analysis is its focus on 
purely asexual populations: we have neglected the ef- 
fects of recombination. Thus our results are primarily 
applicable to interpreting the patterns of genetic varia- 
tion in asexual microbial evolution experiments, though 
they may also be relevant to sexual organisms on short 
genomic distance scales within which recombination is 
rare on the relevant timescales. We note however that 
our results provide an essential ingredient for predicting 
the effects of infrequent recombination on the evolution- 
ary dynamics. Specifically, we can use our predictions 



for the genetic variation between a pair of individuals 
sampled from the population to predict the distribution 
of fitnesses of recombinant offspring resulting from sex 
between these individuals. This in turn determines how 
rare recombination alters the evolutionary dynamics and 
the distribution of fitnesses within the population. It 
may prove possible to then in turn calculate how these 
shifts in evolutionary dynamics alter the patterns of ge- 
netic diversity in the population. These extensions of 
our approach to analyze the effects of recombination on 
both evolutionary dynamics and patterns of molecular 
evolution are an important direction for future research. 
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APPENDIX A: COALESCENCE PROBABILITIES 

In this Appendix, we carry out the calculations of co- 
alescence probabilities in detail. Consider H individuals 
who coalesce into K lineages, with hi individuals coalesc- 
ing into lineage 1, h2 individuals coalescing into lineage 
2, and so on, up to hx individuals coalescing into lineage 
K. We note that XljLi = H. We begin by asking the 
probability that H individuals coalesce into K lineages 
at a specific set of K sites (out of the total of B) in the 
genome: call these sites 1 through K in the genome, for 
concreteness. We also assume for now that the H indi- 
viduals coalesce in a specific way into these K lineages: 
i.e. individual 3 coalesces into the lineage at site 5, etc). 
We denote the frequency of the lineage at site j in the 
genome by fj] so that fj = ^. We denote by A the 
probability that the H individuals coalesce into the K 
lineages at these specific sites according to the specific 
configuration {hj}. 

Given these definitions, we have: 

■'={n^'')={n^)=(^n ■''')■ 

We make use of the identity 

1 r"^ T^-^ 
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to obtain 



K 



m) 



n ) 



(62) 



We now use the definition of cr as ttie sum of tiie Vj and 
separate out the Vj that correspond to the lineages we 
are considering. Note that the Vj are independent of 
each other. Thus one obtains 



A 



-L 



r{H) 

whence, by independence, 



vfe-'"'^ )dx 



\j=i 



(63) 



roo H-1 I ^ \ 

A = l {J[-;'e--^yx. (64) 



Prom Eq. (10) we have 



-fli/Uz 



l-l/q 



-z'^/B 



(65) 



where a = 1 — K Substituting this in, and assuming 
large B so that {B - K)/B ^ 1, we find 

roo H-l I ^ u \ 

^=1 f(^^"^"(^n-^^-^''y^- (66) 



We then use that 



(i/^e-^^) = (-1)^ 



gh 



-z"^ IB 



(67) 



Making the large-5 approximation that e ^"/^ 1 — ^ 
and differentiating, we find 



\ BV{\-a) 
Using this result, we have 



r(F) 



n 



BV{l-a) 



(68) 



dx. (69) 



Since Y^^^-^ hj = H we can rewrite this as 



A 



oo ^Ka 



X dx a 



T{hj — a) 

X B^T{Hy r(l - a) ' 
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Now we define 



y = x^ dy = ax^ dx 



dy dx 
ay X ' 



(70) 



(71) 



and making this change of variables obtain 

B^r(iJ) r(i - a) ■ 

The dy integral yields a F function, giving 

r{hj - a) 







BKY{H) n r(l-a) 



(72) 



(73) 



So far we have considered the probability of this coa- 
lescence event involving K lineages at a specific set of K 
sites on the genome. We now want to sum over all the 
possible sets of K sites on the genome at which this could 
occur. There are a total of B^ /K\ of these. We define E 
to be the probability of this coalescence event involving 
K lineages at any set of K sites on the genome. We have 



E 



a 



K-l 



K 

n 



T{hj - a) 



KV{H) f \ r(l - a) 



(74) 



Now so far we have assumed that specific individuals 
coalesce into specific lineages. But given a set {hj} there 
are a total of (^^ ways to assign specific indi- 

viduals to specific lineages. Thus the total probability 
of H individuals coalescing into K lineages, in a specific 
configuration {hj}, which we will call CH^K,{hj}i is 



C 



HI 



H,K,{h,} 



r{hj - a) 



K 

n 



(75) 



Ha' 



n 



T{hj - a) 



K i-ir(/i, + i)r(i-a)' 



equivalent to Eq. (15) in the main text. 



To compute Dhk^ we first make the definition 

T{h^ - a) 



f{H,K) 



K 



\ V{h, + i)r(i - a) 



and note that 



Dhk 



H 



f{H,K). 
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There is no simple analytic expression for /(i^, K). How- 
ever, we can define its generating function 



(78) 



H=0 



Note we are summing from H = 0: even though for 
H < K this is not biologically relevant, it will be use- 
ful formally. Now we have 



H=0 



constrained 

E 

{hj} 



f{H,K)z" 



(79) 
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hi=0h2=0 hK=0 



where we have used the fact that the sums over the dif- 
ferent h are now independent. Recognizing the Taylor 
series, we have 



Substituting in for f{H, K), we find 



Rf{z) = [i-{i-zr] 



K 



(81) 



E 



aT{h - a)z^ 
V{h^l)V(l-a) 



-I K 



as quoted in the main text. Note we can also plug in 
(80) K = 1 to recover the result for Dhi quoted in Eq. (14). 
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